AITopics | best system

What are the best Systems? New Perspectives on NLP Benchmarking

Neural Information Processing SystemsDec-24-2025, 23:23:11 GMT

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (\textit{e.g.} GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

best system, name change, new perspective, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

What are the best Systems? New Perspectives on NLP Benchmarking

Neural Information Processing SystemsJan-18-2025, 12:29:11 GMT

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e.g.} GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks.

new perspective, nlp benchmarking, procedure, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Reviews: Decoding with Value Networks for Neural Machine Translation

Neural Information Processing SystemsOct-7-2024, 16:14:00 GMT

This paper addresses one of the limitation of NMT, the so-called exposure bias, that results from the fact that each word is chosen greedily. For this, the authors build on standard technique of reinforcement learning and try to predict, for each outgoing transition of a given state, the expected reward that will be achieved if the system take this transition. The article is overall very clear and the proposed ideas quite appealing, even if many of the decisions seem quite ad hoc (e.g. More importantly, several implementation "details" are not specified. For instance, in Equation (6), the BLEU function is defined at the sentence level while in the actual BLEU metric is defined at the corpus level.

neural machine translation, prediction, value network, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

What are the best systems? New perspectives on NLP Benchmarking

#artificialintelligenceFeb-10-2022, 00:28:34 GMT

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions.

best system, nlp benchmarking, procedure, (2 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (0.60)

Add feedback

What are the best systems? New perspectives on NLP Benchmarking

Colombo, Pierre, Noiry, Nathan, Irurozki, Ekhine, Clemencon, Stephan

arXiv.org Artificial IntelligenceFeb-10-2022

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

aggregation, arxiv preprint arxiv, procedure, (13 more...)

arXiv.org Artificial Intelligence

2202.03799

Country:

Europe > France (0.04)
North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.86)

Add feedback

Selecting the Best Optimizing System

Si, Nian, Zheng, Zeyu

arXiv.org Machine LearningJan-9-2022

We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the system's expected performance. An SBOS problem compares different systems based on their expected performances under their own optimally chosen decision to select the best, without advance knowledge of expected performances of the systems nor the optimizing decision inside each system. We design easy-to-implement algorithms that adaptively chooses a system and a choice of decision to evaluate the noisy system performance, sequentially eliminates inferior systems, and eventually recommends a system as the best after spending a user-specified budget. The proposed algorithms integrate the stochastic gradient descent method and the sequential elimination method to simultaneously exploit the structure inside each system and make comparisons across systems. For the proposed algorithms, we prove exponential rates of convergence to zero for the probability of false selection, as the budget grows to infinity. We conduct three numerical examples that represent three practical cases of SBOS problems. Our proposed algorithms demonstrate consistent and stronger performances in terms of the probability of false selection over benchmark algorithms under a range of problem settings and sampling budgets.

algorithm, optimization, selection, (14 more...)

arXiv.org Machine Learning

2201.03065

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Automated Question Answering System for Community-Based Questions

Pithyaachariyakul, Chanin (San Francisco State University) | Kulkarni, Anagha (San Francisco State University)

AAAI ConferencesFeb-8-2018

Answer (Y!A), and Quora, indicate that for certain information needs, users prefer receiving focused answers to their questions, rather than a list of URLs from search results. This trend has sparked a rich area of investigation at the intersection of Information Retrieval (IR), Natural Language Processing (NLP), and Machine Learning (ML) of Automated Question Answering (QA).

artificial intelligence, natural language, question answering, (16 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: North America > United States > California > San Francisco County > San Francisco (0.15)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.31)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

MIT's new AI could eliminate video buffering woes

Daily Mail - Science & techAug-14-2017, 22:40:04 GMT

MIT discovered a way to improve video streaming by reducing buffering times and pixelation. A new AI developed at the university's Computer Science and Artificial Intelligence Laboratory uses machine learning to pick different algorithms depending on network conditions. In doing so, the AI, called Pensieve, has been shown to deliver a higher-quality streaming experience with less buffering than existing systems. Streaming sites use ABR algorithms to determine which resolution videos will play at. Instead of sending a video to your computer in one complete piece, it breaks it up into smaller pieces and sends them sequentially.

artificial intelligence, machine learning, video, (16 more...)

Daily Mail - Science & tech

Country: North America > United States > California > Los Angeles County > Los Angeles (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback